Kapitel 6.2: Zentralität – Vektoren Encoding¶

Das Notebook ergänzt Kapitel 6.2 'Zentralität'.

Import¶

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from tqdm.notebook import tqdm
from resources_geschichtslyrik import *

from sklearn.preprocessing import OneHotEncoder
from sklearn.manifold import MDS
from itertools import product

from scipy.spatial.distance import cityblock, euclidean, cosine
In [2]:
pd.set_option('display.max_colwidth', None)
In [3]:
meta = pd.read_json(r"../resources/meta.json")

Korpora¶

In [4]:
meta_anth = (
    meta
    .query("corpus=='anth'")
    .query("1850 <= year <= 1918")
    .query("geschichtslyrik == 1")
    .drop_duplicates(subset='author_title')
    .reset_index(drop = True)
)
In [5]:
modcanon_authors = ['Hofmannsthal, Hugo von', 'Rilke, Rainer Maria', 'George, Stefan', 'Heym, Georg']

meta_modcanon = (
    meta
    .query("author in @modcanon_authors")
    .query("1850 <= year <= 1918")
    .query("geschichtslyrik == 1")
    .drop_duplicates(subset='author_title')
    .reset_index(drop = True)
)
In [6]:
muench_authors = ['Münchhausen, Börries von', 'Miegel, Agnes', 'Strauß und Torney, Lulu von']

meta_muench = (
    meta
    .query("author in @muench_authors")
    .query("1850 <= year <= 1918")
    .query("geschichtslyrik == 1")
    .drop_duplicates(subset='author_title')
    .reset_index(drop = True)
)
In [7]:
meta_all = pd.concat([meta_anth, meta_modcanon, meta_muench])
meta_all = meta_all.drop_duplicates(subset = 'id')
meta_all = meta_all.reset_index(drop = True)

meta_all['korpus_anth'] = [True if x in list(meta_anth['author_title']) else False for x in meta_all['author_title']]
meta_all['korpus_modcanon'] = [True if x in modcanon_authors else False for x in meta_all['author']]
meta_all['korpus_muench'] = [True if x in muench_authors else False for x in meta_all['author']]

meta_all.shape[0]
Out[7]:
2063

Ratings¶

In [8]:
stoffgebiet_ratings = get_rating_table(meta = meta_all, mode = 'themes')
entity_ratings = get_rating_table(meta = meta_all, mode = 'entity')

Feature-Überblick¶

  • features_all_df : DataFrame mit Angaben für alle zu nutzenden Features (Name, Encoding, Gewicht)
In [9]:
features_data = [
    ['geschichtslyrik', 'ordinal', 1],
    ['empirisch', 'bin', 1],
    ['theoretisch', 'bin', 1],
    ['gattung', 'nominal_multi', 1],
    ['sprechinstanz_markiert', 'bin', 1],
    ['sprechinstanz_in_vergangenheit', 'nominal', 1],
    ['sprechakte', 'nominal_multi', 1],
    ['tempus', 'nominal_multi', 1],
    ['konkretheit', 'ordinal', 1],
    ['wissen', 'nominal', 1], # hier nominal statt ordinal, u. a. da ambivalent vs. neutral
    ['vergangenheitsdominant', 'ordinal', 1],
    ['zeitebenen', 'interval', 1],
    ['fixierbarkeit', 'bin', 1],
    ['beginn', 'interval', 1], # intervall ist Vereinfachung, vgl. NaN
    ['ende', 'interval', 1], # intervall ist Vereinfachung, vgl. NaN
    ['anachronismus', 'bin', 1],
    ['gegenwartsbezug', 'bin', 1],
    ['grossraum', 'nominal_multi', 1],
    ['mittelraum', 'nominal_multi', 1],
    ['kleinraum', 'nominal_multi', 1],
    ['inhaltstyp', 'nominal_multi', 1],
    ['stoffgebiet', 'nominal_multi_sim', 1],
    ['stoffgebiet_bewertung', 'nominal_multi_dependent', 1], # hier nominal statt ordinal, u. a. da ambivalent vs. neutral
    ['entity_simple', 'nominal_multi', 1],
    ['entity_bewertung', 'nominal_multi_dependent', 1], # hier nominal statt ordinal, u. a. da ambivalent vs. neutral
    ['nationalismus', 'bin', 1],
    ['heroismus', 'bin', 1],
    ['religiositaet', 'bin', 1],
    ['marker_person', 'bin_multi', 1],
    ['marker_zeit', 'bin_multi', 1],
    ['marker_ort', 'bin_multi', 1],
    ['marker_objekt', 'bin_multi', 1],
    ['ueberlieferung', 'bin', 1],
    ['ueberlieferung_bewertung', 'nominal', 1], # hier nominal statt ordinal, u. a. da ambivalent vs. neutral
    ['geschichtsauffassung', 'bin', 1],
    ['geschichtsauffassung_bewertung', 'nominal', 1], # hier nominal statt ordinal, u. a. da ambivalent vs. neutral
    ['verhaeltnis_wissen', 'nominal_multi', 1], # hier nominal statt ordinal, u. a. da natürlich vs. übernatürlich
    ['reim', 'ordinal', 1],
    ['metrum', 'ordinal', 1],
    ['verfremdung', 'ordinal', 1],
]

features_all_df = pd.DataFrame(features_data, columns=['feature', 'encoding', 'weight'])
In [10]:
features_all_df.head()
Out[10]:
feature encoding weight
0 geschichtslyrik ordinal 1
1 empirisch bin 1
2 theoretisch bin 1
3 gattung nominal_multi 1
4 sprechinstanz_markiert bin 1
In [11]:
features_all_df['encoding'].value_counts()
Out[11]:
encoding
bin                        11
nominal_multi               9
ordinal                     6
nominal                     4
bin_multi                   4
interval                    3
nominal_multi_dependent     2
nominal_multi_sim           1
Name: count, dtype: int64
  • bin = binär (z. B. Gegenwartsbezug: ja/nein)
  • bin_multi = binär mit Mehrfachannotationen (z. B. Personenmarker: Titel ja/nein und Text ja/nein)
  • ordinal = ordinal (z. B. Reim: gar nicht, teilweise, durchgängig)
  • ordinal_multi_dependent = ordinal mit Mehrfachannotationen und Bezug auf andere Annotationskategorie
  • nominal = nominal (z. B. Zeitebene der Sprechinstanz: nicht markiert, Vergangenheit, nicht Vergangenheit)
  • nominal_multi = nominal mit Mehrfachannotationen (z. B. Gattung: 'Ballade', 'Ballade + Lied' usw.)
  • nominal_multi_dependent = nominal mit Mehrfachannotationen und Bezug auf andere Annotationskategorie (z. B. Bewertung von Stoffgebieten: negativ, neutral/ambivalent, positiv)
  • nominal_multi_sim = nominal mit Mehrfachannotationen, wobei verschiedene Aspekte einander näher stehen können als andere (Stoffgebiete: 'Krieg', 'Krieg + Politik' usw.)
  • interval = intervallskaliert (z. B. Anzahl Zeitebenen: 0, 1, 2, 3, 4 ...)

Features im passenden Encoding¶

  • features_used_df : DataFrame mit Angaben für alle Features, die bereits im 'richtigen' Encoding vorliegen. Wird nach und nach befüllt.
  • features_used : Series mit Namen der Features, die bereits im 'richtigen' Encoding vorliegen
In [12]:
features_proper_encoding = features_all_df.query("encoding == 'bin' | encoding == 'ordinal' | encoding == 'interval'")
features_proper_encoding = features_proper_encoding['feature']

exceptions = [
    'beginn', 'ende'
]
features_proper_encoding = [x for x in features_proper_encoding if x not in exceptions]
In [13]:
features_used_df = features_all_df.query("feature in @features_proper_encoding")

features_used_df
Out[13]:
feature encoding weight
0 geschichtslyrik ordinal 1
1 empirisch bin 1
2 theoretisch bin 1
4 sprechinstanz_markiert bin 1
8 konkretheit ordinal 1
10 vergangenheitsdominant ordinal 1
11 zeitebenen interval 1
12 fixierbarkeit bin 1
15 anachronismus bin 1
16 gegenwartsbezug bin 1
25 nationalismus bin 1
26 heroismus bin 1
27 religiositaet bin 1
32 ueberlieferung bin 1
34 geschichtsauffassung bin 1
37 reim ordinal 1
38 metrum ordinal 1
39 verfremdung ordinal 1
In [14]:
features_used = features_used_df['feature']
In [15]:
meta_all[['author', 'title'] + features_used.tolist()].sample(n=5)
Out[15]:
author title geschichtslyrik empirisch theoretisch sprechinstanz_markiert konkretheit vergangenheitsdominant zeitebenen fixierbarkeit anachronismus gegenwartsbezug nationalismus heroismus religiositaet ueberlieferung geschichtsauffassung reim metrum verfremdung
657 Gerok, Karl Heinrichs I. Wahl 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
1768 Hohlbaum, Robert Luther 1.0 1.0 1.0 1.0 0.0 1.0 2.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.5
1630 Müller von Königswinter, Wolfgang Das Zepter Rudolfs von Habsburg 1.0 1.0 0.0 0.0 1.0 1.0 2.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
843 Bergmann, Werner Bei Rexpoede 1.0 1.0 0.0 1.0 1.0 1.0 2.0 1.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0
1981 Münchhausen, Börries von Le Ralli 1.0 1.0 0.0 0.0 1.0 1.0 2.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0

Alle weiteren Features¶

In [16]:
def OneHotSimple (data, category_name): # für nominal
    encoder = OneHotEncoder(sparse_output=False)
    onehot_df = pd.DataFrame(encoder.fit_transform(np.array(data).reshape(-1,1)))
    
    column_names = [category_name + "_" + x for x in data.sort_values().unique()]
    onehot_df.columns = column_names
    
    return onehot_df
In [17]:
def OneHotMulti (data, category_name): # für nominal_multi
    values = [x.split(" + ") for x in data]
    values = [item for sublist in values for item in sublist]
    values = pd.Series(values).unique()
        
    column_names = [category_name + "_" + x for x in values]
    
    onehot_df = pd.DataFrame(columns = column_names)

    for i, element in enumerate(meta_all.iloc):
        for value in values:
            onehot_df.at[i, category_name + "_" + value] = element[category_name].count(value)
    onehot_df = onehot_df.fillna(0)
    
    return onehot_df

gattung [nominal_multi]: One-Hot Encoding¶

In [18]:
meta_all['gattung'] = [str(x) for x in meta_all['gattung']]
data = meta_all['gattung']
In [19]:
onehot_df = OneHotMulti(data, 'gattung')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  onehot_df = onehot_df.fillna(0)
In [20]:
onehot_df
Out[20]:
gattung_Ballade gattung_Sonett gattung_Lied gattung_None gattung_Rollengedicht gattung_Denkmal-/Ruinenpoesie
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 1 0 0 0 0 0
3 1 0 0 0 0 0
4 0 0 1 0 0 0
... ... ... ... ... ... ...
2058 0 0 0 1 0 0
2059 0 0 0 1 0 0
2060 0 0 0 1 0 0
2061 1 0 0 0 0 0
2062 0 0 0 1 0 0

2063 rows × 6 columns

In [21]:
if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'author', 'title',
    'gattung',
    'gattung_Ballade',
    'gattung_Lied',
    'gattung_Denkmal-/Ruinenpoesie',
    'gattung_Rollengedicht',
    'gattung_Sonett',
    'gattung_None'
]].sample(n=5)
Out[21]:
author title gattung gattung_Ballade gattung_Lied gattung_Denkmal-/Ruinenpoesie gattung_Rollengedicht gattung_Sonett gattung_None
407 Sturm, Julius Frau Elsa Ballade 1 0 0 0 0 0
1617 Lissauer, Ernst Aus dem Dreißigjährigen Kriege Ballade 1 0 0 0 0 0
1056 Köppen, Fedor von Mein deutsches Volk, o denke dran! None 0 0 0 0 0 1
2034 Münchhausen, Börries von Bayard. Der Ritterschlag Ballade 1 0 0 0 0 0
873 Hosäus, Wilhelm Albrecht der Bär Ballade 1 0 0 0 0 0
In [22]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal_multi'] * len(column_names),
    'encoding' : ['bin'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})

features_used_df = pd.concat([
    features_used_df,
    features_used_add
]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')
In [23]:
features_used_df
Out[23]:
feature encoding weight encoding_orig
0 geschichtslyrik ordinal 1.000000 NaN
1 empirisch bin 1.000000 NaN
2 theoretisch bin 1.000000 NaN
3 sprechinstanz_markiert bin 1.000000 NaN
4 konkretheit ordinal 1.000000 NaN
5 vergangenheitsdominant ordinal 1.000000 NaN
6 zeitebenen interval 1.000000 NaN
7 fixierbarkeit bin 1.000000 NaN
8 anachronismus bin 1.000000 NaN
9 gegenwartsbezug bin 1.000000 NaN
10 nationalismus bin 1.000000 NaN
11 heroismus bin 1.000000 NaN
12 religiositaet bin 1.000000 NaN
13 ueberlieferung bin 1.000000 NaN
14 geschichtsauffassung bin 1.000000 NaN
15 reim ordinal 1.000000 NaN
16 metrum ordinal 1.000000 NaN
17 verfremdung ordinal 1.000000 NaN
18 gattung_Ballade bin 0.166667 nominal_multi
19 gattung_Sonett bin 0.166667 nominal_multi
20 gattung_Lied bin 0.166667 nominal_multi
21 gattung_None bin 0.166667 nominal_multi
22 gattung_Rollengedicht bin 0.166667 nominal_multi
23 gattung_Denkmal-/Ruinenpoesie bin 0.166667 nominal_multi

sprechinstanz_in_vergangenheit [nominal]: One-Hot Encoding¶

In [24]:
data = meta_all['sprechinstanz_in_vergangenheit']
data = data.replace({float('NaN') : 'unmarkiert', 0 : 'gegenwart', 1 : 'vergangenheit'})
In [25]:
onehot_df = OneHotSimple(data, 'sprechinstanz_in_vergangenheit')
In [26]:
if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'sprechinstanz_in_vergangenheit',
    'sprechinstanz_in_vergangenheit_unmarkiert',
    'sprechinstanz_in_vergangenheit_vergangenheit',
    'sprechinstanz_in_vergangenheit_gegenwart'
]].sample(n=10)
Out[26]:
sprechinstanz_in_vergangenheit sprechinstanz_in_vergangenheit_unmarkiert sprechinstanz_in_vergangenheit_vergangenheit sprechinstanz_in_vergangenheit_gegenwart
458 0.0 0.0 0.0 1.0
1759 NaN 1.0 0.0 0.0
1289 NaN 1.0 0.0 0.0
2007 1.0 0.0 1.0 0.0
682 NaN 1.0 0.0 0.0
378 NaN 1.0 0.0 0.0
1967 NaN 1.0 0.0 0.0
1517 NaN 1.0 0.0 0.0
1325 1.0 0.0 1.0 0.0
1504 NaN 1.0 0.0 0.0
In [27]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal'] * len(column_names),
    'encoding' : ['bin'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

sprechakte [nominal_multi]: One-Hot Encoding¶

In [28]:
meta_all['sprechakte'] = [str(x) for x in meta_all['sprechakte']]
data = meta_all['sprechakte']
In [29]:
onehot_df = OneHotMulti(data, 'sprechakte')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  onehot_df = onehot_df.fillna(0)
In [30]:
if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'sprechakte',
    'sprechakte_Erzählen',
    'sprechakte_Beschreiben',
    'sprechakte_Auffordern',
]].sample(n=10)
Out[30]:
sprechakte sprechakte_Erzählen sprechakte_Beschreiben sprechakte_Auffordern
41 Erzählen 1 0 0
1056 Auffordern + Beschreiben 0 1 1
299 Erzählen 1 0 0
148 Auffordern + Beschreiben 0 1 1
495 Erzählen 1 0 0
332 Erzählen 1 0 0
23 Erzählen 1 0 0
1239 Erzählen 1 0 0
644 Erzählen 1 0 0
1026 Auffordern + Beschreiben + Erzählen 1 1 1
In [31]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal_multi'] * len(column_names),
    'encoding' : ['bin'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

tempus [nominal_multi]: One-Hot Encoding¶

In [32]:
meta_all['tempus'] = [str(x) for x in meta_all['tempus']]
data = meta_all['tempus']
In [33]:
onehot_df = OneHotMulti(data, 'tempus')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  onehot_df = onehot_df.fillna(0)
In [34]:
if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'tempus',
    'tempus_Präsens',
    'tempus_Präteritum',
    'tempus_Futur',
]].sample(n=5)
Out[34]:
tempus tempus_Präsens tempus_Präteritum tempus_Futur
22 Präsens + Präteritum 1 1 0
1066 Präsens + Präteritum 1 1 0
257 Präsens + Präteritum 1 1 0
1250 Präsens 1 0 0
360 Präsens 1 0 0
In [35]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal_multi'] * len(column_names),
    'encoding' : ['bin'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

wissen [nominal]: One-Hot-Encoding¶

In [36]:
# Nominaler Ansatz
data = meta_all['wissen']
data = data.replace({float('NaN') : 'neutral', 0 : 'ambivalent', 1 : 'wissend', -1 : 'unwissend'})

onehot_df = OneHotSimple(data, 'wissen')

if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'wissen',
    'wissen_neutral',
    'wissen_wissend',
    'wissen_unwissend',
    'wissen_ambivalent',
]].sample(n=10)
Out[36]:
wissen wissen_neutral wissen_wissend wissen_unwissend wissen_ambivalent
1041 NaN 1.0 0.0 0.0 0.0
73 NaN 1.0 0.0 0.0 0.0
1261 NaN 1.0 0.0 0.0 0.0
837 NaN 1.0 0.0 0.0 0.0
1707 NaN 1.0 0.0 0.0 0.0
1209 0.0 0.0 0.0 0.0 1.0
132 NaN 1.0 0.0 0.0 0.0
1371 NaN 1.0 0.0 0.0 0.0
1645 NaN 1.0 0.0 0.0 0.0
605 NaN 1.0 0.0 0.0 0.0
In [37]:
# Ordinaler Ansatz

# meta_all['wissen'] = meta_all['wissen'].replace({float('NaN') : 0})
In [38]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal'] * len(column_names),
    'encoding' : ['bin'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

beginn [intervall]: Vereinheitlichen (NaN mit Median ersetzen)¶

In [39]:
meta_all.query("beginn.isna()").shape[0]
Out[39]:
9
In [40]:
beginn_median = meta_all.query("korpus_anth and beginn.notna()")['beginn'].median()
beginn_median
Out[40]:
1521.0
In [41]:
meta_all['beginn'] = [x if pd.notna(x) else beginn_median for x in meta_all['beginn']]
In [42]:
meta_all.query("beginn.isna()").shape[0]
Out[42]:
0
In [43]:
features_used_add = pd.DataFrame({
    'feature' : ['beginn'],
    'encoding_orig' : ['interval'],
    'encoding' : ['interval'],
    'weight' : [1]
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

ende [intervall]: Vereinheitlichen (NaN mit Median ersetzen)¶

In [44]:
meta_all.query("ende.isna()").shape[0]
Out[44]:
9
In [45]:
ende_median = meta_all.query("korpus_anth and ende.notna()")['ende'].median()
ende_median
Out[45]:
1523.0
In [46]:
meta_all['ende'] = [x if pd.isna(x) == False else ende_median for x in meta_all['ende']]
In [47]:
meta_all.query("ende.isna()").shape[0]
Out[47]:
0
In [48]:
features_used_add = pd.DataFrame({
    'feature' : ['ende'],
    'encoding_orig' : ['interval'],
    'encoding' : ['interval'],
    'weight' : [1]
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

grossraum [nominal_multi]: One-Hot Encoding¶

In [49]:
meta_all['grossraum'] = [str(x) for x in meta_all['grossraum']]
data = meta_all['grossraum']
In [50]:
onehot_df = OneHotMulti(data, 'grossraum')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  onehot_df = onehot_df.fillna(0)
In [51]:
if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'grossraum',
    'grossraum_Europa',
    'grossraum_Asien',
    'grossraum_Afrika',
]].sample(n=10)
Out[51]:
grossraum grossraum_Europa grossraum_Asien grossraum_Afrika
1525 Europa 1 0 0
480 Europa 1 0 0
1070 Europa 1 0 0
728 Europa 1 0 0
280 Europa 1 0 0
278 Europa 1 0 0
844 Europa 1 0 0
1515 Europa 1 0 0
465 Europa 1 0 0
721 Europa 1 0 0
In [52]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal_multi'] * len(column_names),
    'encoding' : ['bin'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

mittelraum [nominal_multi]: One-Hot Encoding¶

In [53]:
meta_all['mittelraum'] = [str(x) for x in meta_all['mittelraum']]
In [54]:
# mittelraum_deutsch = ["Heiliges Römisches Reich", "Proto-Deutschland", "Deutsches Kaiserreich", "Fränkisches Reich", "Germanien", 
#                       "Ostfränkisches Reich", "Deutschland", "Deutschordensstaat", "Deutscher Bund"]
# 
# for raum in mittelraum_deutsch:
#     meta_all['mittelraum'] = [re.sub(raum, "Deutscher Raum", x) for x in meta_all['mittelraum']]
In [55]:
data = meta_all['mittelraum']
In [56]:
onehot_df = OneHotMulti(data, 'mittelraum')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  onehot_df = onehot_df.fillna(0)
In [57]:
if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'mittelraum',
    'mittelraum_Heiliges Römisches Reich',
    'mittelraum_Frankreich',
    'mittelraum_None'
]].sample(n=10)
Out[57]:
mittelraum mittelraum_Heiliges Römisches Reich mittelraum_Frankreich mittelraum_None
902 Deutscher Raum 0 0 0
844 Deutscher Raum 0 0 0
720 Byzantinisches Reich 0 0 0
855 Deutscher Raum 0 0 0
145 Deutscher Raum 0 0 0
243 Großbritannien 0 0 0
1501 Kaisertum Österreich 0 0 0
159 Heiliges Römisches Reich 1 0 0
1786 Kaisertum Österreich 0 0 0
1678 Heiliges Römisches Reich 1 0 0
In [58]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal_multi'] * len(column_names),
    'encoding' : ['bin'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

kleinraum [nominal_multi]: One-Hot Encoding¶

In [59]:
meta_all['kleinraum'] = [str(x) for x in meta_all['kleinraum']]
data = meta_all['kleinraum']
In [60]:
onehot_df = OneHotMulti(data, 'kleinraum')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  onehot_df = onehot_df.fillna(0)
In [61]:
if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'kleinraum',
    'kleinraum_Paris',
    'kleinraum_Berlin',
    'kleinraum_Wien',
    'kleinraum_None',
]].sample(n=10)
Out[61]:
kleinraum kleinraum_Paris kleinraum_Berlin kleinraum_Wien kleinraum_None
359 None 0 0 0 1
513 Wien 0 0 1 0
1634 Augsburg 0 0 0 0
1968 Paris 1 0 0 0
270 Bouvines 0 0 0 0
761 None 0 0 0 1
665 Mons Lactarius 0 0 0 0
757 Eresburg 0 0 0 0
1954 None 0 0 0 1
845 Kosel 0 0 0 0
In [62]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal_multi'] * len(column_names),
    'encoding' : ['bin'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

inhaltstyp [nominal_multi]: One-Hot Encoding¶

In [63]:
meta_all['inhaltstyp'] = [str(x) for x in meta_all['inhaltstyp']]
data = meta_all['inhaltstyp']
In [64]:
onehot_df = OneHotMulti(data, 'inhaltstyp')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  onehot_df = onehot_df.fillna(0)
In [65]:
if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'inhaltstyp',
    'inhaltstyp_Ereignis',
    'inhaltstyp_Zustand',
]].sample(n=10)
Out[65]:
inhaltstyp inhaltstyp_Ereignis inhaltstyp_Zustand
1312 Zustand 0 1
223 Ereignis 1 0
1585 Ereignis + Zustand 1 1
696 Zustand 0 1
1981 Ereignis 1 0
1003 Zustand 0 1
71 Ereignis 1 0
1740 Ereignis 1 0
1342 Ereignis + Zustand 1 1
1976 Zustand 0 1
In [66]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal_multi'] * len(column_names),
    'encoding' : ['bin'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

stoffgebiete [nominal_multi_sim]¶

Germanet-Synsets für einzelne Stoffgebiete heraussuchen¶

In [67]:
from germanetpy.germanet import Germanet

data_path = "../resources/more/GN_V160_XML"

germanet = Germanet(data_path)
Load GermaNet data...: 100%|███▉| 99.99999999999996/100 [00:08<00:00, 12.04it/s]
Load Wiktionary data...: 100%|████████████| 100.0/100 [00:00<00:00, 1001.49it/s]
Load Ili records...: 100%|███████████████| 100.0/100 [00:00<00:00, 69155.88it/s]
In [68]:
stoffgebiete = stoffgebiet_ratings['type'].unique().tolist()
In [69]:
germanet_synsets = {}

germanet_synsets['Militär/Krieg'] = germanet.get_synsets_by_orthform("Krieg")[0]
germanet_synsets['Politik'] = germanet.get_synsets_by_orthform("Politik")[1]
germanet_synsets['Literatur'] = germanet.get_synsets_by_orthform("Literatur")[1]
germanet_synsets['Architektur'] = germanet.get_synsets_by_orthform("Architektur")[1]
germanet_synsets['Nation/Volk-iD'] = germanet.get_synsets_by_orthform("Region")[2]
germanet_synsets['Aufstand/Revolution'] = germanet.get_synsets_by_orthform("Aufstand")[0]
germanet_synsets['Erfindung/Innovation'] = germanet.get_synsets_by_orthform("Erfindung")[0]
germanet_synsets['Religion'] = germanet.get_synsets_by_orthform("Religion")[1]
germanet_synsets['Herrscherliches Handeln'] = germanet.get_synsets_by_orthform("herrschen")[1]
germanet_synsets['Arbeit'] = germanet.get_synsets_by_orthform("Arbeit")[2]
germanet_synsets['Ertrinken'] = germanet.get_synsets_by_orthform("ertrinken")[0]
germanet_synsets['Essen/Trinken'] = germanet.get_synsets_by_orthform("Nahrungsmittel")[0]
germanet_synsets['Jagd'] = germanet.get_synsets_by_orthform("Jagd")[1]
germanet_synsets['Musik'] = germanet.get_synsets_by_orthform("Musik")[2]
germanet_synsets['Recht'] = germanet.get_synsets_by_orthform("Recht")[2]
germanet_synsets['Nation/Volk-D'] = germanet.get_synsets_by_orthform("Deutschland")[0]
germanet_synsets['Auferstehung/Geister'] = germanet.get_synsets_by_orthform("Auferstehung")[0]
germanet_synsets['Denkmal'] = germanet.get_synsets_by_orthform("Denkmal")[1]
germanet_synsets['Sport'] = germanet.get_synsets_by_orthform("Sport")[1]
germanet_synsets['Waffen'] = germanet.get_synsets_by_orthform("Waffe")[0]
germanet_synsets['Ankunft'] = germanet.get_synsets_by_orthform("Ankunft")[1]
germanet_synsets['Natur'] = germanet.get_synsets_by_orthform("Natur")[3]
germanet_synsets['Nation/Volk-nD'] = germanet.get_synsets_by_orthform("Nation")[0]
germanet_synsets['Identitätsenthüllung'] = germanet.get_synsets_by_orthform("Identität")[1]
germanet_synsets['Kampf'] = germanet.get_synsets_by_orthform("Kampf")[3]
germanet_synsets['Eltern-Kind-Beziehung'] = germanet.get_synsets_by_orthform("Kindererziehung")[0]
germanet_synsets['Astronomie/Astrologie'] = germanet.get_synsets_by_orthform("Astronomie")[0]
germanet_synsets['Italiensehnsucht'] = germanet.get_synsets_by_orthform("Italien")[0]
germanet_synsets['Geburtstag'] = germanet.get_synsets_by_orthform("Geburtstag")[1]
germanet_synsets['Landwirtschaft'] = germanet.get_synsets_by_orthform("Landwirtschaft")[1]
germanet_synsets['Schlaf/Traum'] = germanet.get_synsets_by_orthform("Schlaf")[0]
germanet_synsets['Abschied'] = germanet.get_synsets_by_orthform("Abschied")[1]
germanet_synsets['Sprache'] = germanet.get_synsets_by_orthform("Sprache")[3]
germanet_synsets['Kindheit/Jugend'] = germanet.get_synsets_by_orthform("Kindheit")[0]

for stoffgebiet in stoffgebiete:
    if stoffgebiet not in germanet_synsets:
        synsets = germanet.get_synsets_by_orthform(stoffgebiet)
        if len(synsets) > 0:
            germanet_synsets[stoffgebiet] = synsets[0]
        else:
            germanet_synsets[stoffgebiet] = []

Distanzmatrix zwischen Texten erstellen auf Basis der Germanet-Abstände zwischen Stoffgebiet-Namen (stoffgebiet_dist)¶

In [70]:
meta_all['stoffgebiet'] = [str(x) for x in meta_all['stoffgebiet']]
stoffgebiete_all = meta_all['stoffgebiet']
stoffgebiete_all = [x.split(" + ") for x in stoffgebiete_all]
In [71]:
def get_multi_dist(stoffgebiete_a, stoffgebiete_b):
    if stoffgebiete_a == stoffgebiete_b:
        return 0
    else:
        distances = []
        combinations = list(product(stoffgebiete_a, stoffgebiete_b))
        for combination in combinations:
            distance = get_single_dist(combination[0], combination[1])
            distances.append(distance)

        if all(np.isnan(x) for x in distances):
            return float('NaN')
        else:
            return np.nanmean(distances)

def get_single_dist(stoffgebiet_a, stoffgebiet_b):
    if stoffgebiet_a == stoffgebiet_b:
        return 0
    else:
        stoffgebiet_a = germanet_synsets[stoffgebiet_a]
        stoffgebiet_b = germanet_synsets[stoffgebiet_b]
        try:
            return stoffgebiet_a.shortest_path_distance(stoffgebiet_b)
        except:
            return float('NaN')
In [72]:
get_single_dist('Militär/Krieg', 'Liebe')
Out[72]:
10
In [73]:
get_multi_dist(['Militär/Krieg'], ['Liebe'])
Out[73]:
10.0
In [74]:
get_multi_dist(['Militär/Krieg', 'Religion'], ['Liebe'])
Out[74]:
12.5
In [75]:
get_multi_dist(['Militär/Krieg', 'Religion'], ['Liebe', 'Religion', 'Militär/Krieg'])
Out[75]:
8.5
In [76]:
stoffgebiete_dist = np.empty((len(stoffgebiete_all), len(stoffgebiete_all)))
stoffgebiete_dists = {}

for i, stoffgebiete_a in tqdm(enumerate(stoffgebiete_all), total = len(stoffgebiete_all)):
    for j, stoffgebiete_b in enumerate(stoffgebiete_all):
        # create strings as ids
        stoffgebiete_ab_str = str(stoffgebiete_a)+"–"+str(stoffgebiete_b)
        stoffgebiete_ba_str = str(stoffgebiete_b)+"–"+str(stoffgebiete_a)

        # lookup
        if stoffgebiete_ab_str in stoffgebiete_dists:
            distance = stoffgebiete_dists[stoffgebiete_ab_str]
        elif stoffgebiete_ba_str in stoffgebiete_dists:
            distance = stoffgebiete_dists[stoffgebiete_ba_str]
        
        # get distance
        else:
            distance = get_multi_dist(stoffgebiete_a, stoffgebiete_b)

        stoffgebiete_dist[i, j] = distance
        stoffgebiete_dists[stoffgebiete_ab_str] = distance
        
stoffgebiete_dist = pd.DataFrame(stoffgebiete_dist)
stoffgebiete_dist = stoffgebiete_dist.fillna(stoffgebiete_dist.mean().mean())
  0%|          | 0/2063 [00:00<?, ?it/s]
Test: Beispiele für konkrete Texte und errechnete Distanzen¶
In [77]:
sample_index = meta_all.sample(n=10).index

meta_all[['author', 'title', 'stoffgebiet']].loc[sample_index]
Out[77]:
author title stoffgebiet
21 Sturm, Julius Ein Kunststück Militär/Krieg
1974 Miegel, Agnes Jane Gefangenschaft + Liebe
204 Rocholl, R. König Joram’s Abgötterei Religion
109 Lingg, Hermann Attilas Schwert Politik
570 Mautner, Eduard Admiral Tegetthoff Militär/Krieg + Tod
1918 George, Stefan Vom Ritter der sich verliegt Rittertum
1003 Dahn, Felix Lied Walthers von der Vogelweide Literatur
1270 Hamerling, Robert Der Brand Roms Brand
1521 Lingg, Hermann Der Kinder Kreuzfahrt Religion
640 Gruppe, Otto Friedrich Der Bauer und der Mohr Essen/Trinken
In [78]:
stoffgebiete_dist.loc[sample_index,sample_index]
Out[78]:
21 1974 204 109 570 1918 1003 1270 1521 640
21 0.0 9.5 13.0 2.0 4.0 17.0 13.0 8.0 13.0 10.0
1974 9.5 0.0 12.5 9.5 9.5 16.5 11.5 9.5 12.5 9.5
204 13.0 12.5 0.0 13.0 13.0 16.0 14.0 13.0 0.0 9.0
109 2.0 9.5 13.0 0.0 5.0 17.0 13.0 8.0 13.0 10.0
570 4.0 9.5 13.0 5.0 0.0 17.0 13.0 6.0 13.0 10.0
1918 17.0 16.5 16.0 17.0 17.0 0.0 18.0 17.0 16.0 13.0
1003 13.0 11.5 14.0 13.0 13.0 18.0 0.0 13.0 14.0 11.0
1270 8.0 9.5 13.0 8.0 6.0 17.0 13.0 0.0 13.0 10.0
1521 13.0 12.5 0.0 13.0 13.0 16.0 14.0 13.0 0.0 9.0
640 10.0 9.5 9.0 10.0 10.0 13.0 11.0 10.0 9.0 0.0
Test: Wie weit sind Stoffgebiete eines Texts im Mittelwert von allen anderen Stoffgebieten entfernt?¶
In [79]:
meta_all['dist_stoffgebiet_mean'] = stoffgebiete_dist.mean(axis = 1)

meta_all[[
    'stoffgebiet' ,'dist_stoffgebiet_mean'
]].sort_values(by = 'dist_stoffgebiet_mean').drop_duplicates('stoffgebiet')
Out[79]:
stoffgebiet dist_stoffgebiet_mean
917 Militär/Krieg + Treue/Gefolgschaft 5.810355
89 Militär/Krieg 5.861842
118 Militär/Krieg + Politik 6.083889
429 Militär/Krieg + Verbrechen 6.163345
1148 Sport + Militär/Krieg 6.171585
... ... ...
951 Nation/Volk-nD 16.253989
1303 Stadt 16.338656
2060 Adel 16.405791
1854 Rittertum 16.647833
351 Genie 16.692994

359 rows × 2 columns

Texte mit Stoffgebiet X (z. B. 'Militär/Krieg') sind in puncto Stoffgebiete im Mittelwert [dist_stoffgebiet_mean] von allen anderen Texten entfernt.

Distanzmatrix über MDS in 20 Dimensionen transformieren und diese (stoffgebiete_dim_1, stoffgebiete_dim_2 usw.) als Features hinzufügen¶

In [80]:
# from sklearn.manifold import MDS
# from scipy.spatial import distance
# 
# n_components = 20
# random_states = range(17, 40)
# 
# for random_state in random_states:
#     model = MDS(n_components = n_components, random_state = random_state, dissimilarity = 'precomputed')
#     column_names = ['stoffgebiete_dim_' + str(i+1) for i in range(n_components)]
#     meta_all[column_names] = model.fit_transform(stoffgebiete_dist)
#     stoffgebiete_centroid = meta_all[column_names].mean()
#     for i, element in enumerate(meta_all.iloc):
#         meta_all.at[i, 'dist_stoffgebiet_centroid_cosine'] = distance.cosine(element[column_names], stoffgebiete_centroid)
#     print(f"random state  : {random_state}")
#     print(f"corr cosine : {meta_all[['dist_stoffgebiet_mean', 'dist_stoffgebiet_centroid_cosine']].corr().iloc[0,1]}")
#     print(f"\n")
In [81]:
n_components = 20

model = MDS(n_components = n_components, random_state = 24, dissimilarity = 'precomputed')
In [82]:
column_names = ['stoffgebiete_dim_' + str(i+1) for i in range(n_components)]
In [83]:
meta_all[column_names] = model.fit_transform(stoffgebiete_dist)
Test: model stress¶
In [84]:
stress = model.stress_
stress1 = np.sqrt(stress / (0.5 * np.sum(stoffgebiete_dist.values**2)))

print(f"sklearn stress   : {stress}")
print(f"Kruskal's stress : {stress1}")
sklearn stress   : 238485.3679399986
Kruskal's stress : 0.034216049330838744
Test: werden in ersten zwei Dimensionen stoffliche Unterschiede (hier: Recht vs. Other) sichtbar?¶
In [85]:
meta_all['concrete_stoffgebiet'] = ['Recht' if x == 'Recht' else 'Other' for x in meta_all['stoffgebiet']]

px.scatter(
    meta_all,
    x = 'stoffgebiete_dim_1',
    y = 'stoffgebiete_dim_2',
    color = 'concrete_stoffgebiet',
    hover_data = ['stoffgebiet']
    
)
Test: Wie hängt Zentralität laut Distanzmatrix (stoffgebiete_dist) zusammen mit Zentralität laut Dimensionsmodell? Ist Text, der sehr niedrigen 'dist_stoffgebiet_mean'-Wert hat, auch nah am Zentroiden des Dimensionsmodells?¶
In [86]:
stoffgebiete_centroid = meta_all[column_names].mean()

for i, element in enumerate(meta_all.iloc):
    meta_all.at[i, 'dist_stoffgebiet_centroid_manhattan'] = cityblock(element[column_names], stoffgebiete_centroid)
    meta_all.at[i, 'dist_stoffgebiet_centroid_euclidean'] = euclidean(element[column_names], stoffgebiete_centroid)
    meta_all.at[i, 'dist_stoffgebiet_centroid_cosine'] = cosine(element[column_names], stoffgebiete_centroid)
In [87]:
# dist_stoffgebiet_centroid_cosine hängt stark von random_state ab, ist praktisch zufällig (kann auch negativ sein)

meta_all[[
    'dist_stoffgebiet_mean',
    'dist_stoffgebiet_centroid_manhattan', 'dist_stoffgebiet_centroid_euclidean', 'dist_stoffgebiet_centroid_cosine'
]].corr()
Out[87]:
dist_stoffgebiet_mean dist_stoffgebiet_centroid_manhattan dist_stoffgebiet_centroid_euclidean dist_stoffgebiet_centroid_cosine
dist_stoffgebiet_mean 1.000000 0.987770 0.993300 -0.824851
dist_stoffgebiet_centroid_manhattan 0.987770 1.000000 0.995385 -0.799814
dist_stoffgebiet_centroid_euclidean 0.993300 0.995385 1.000000 -0.799277
dist_stoffgebiet_centroid_cosine -0.824851 -0.799814 -0.799277 1.000000
Übertragung¶
In [88]:
features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal_multi_sim'] * len(column_names),
    'encoding' : ['interval'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

stoffgebiet_bewertung [nominal_multi_dependent]¶

In [89]:
stoffgebiet_ratings['rating'] = stoffgebiet_ratings['rating'].astype(int)
In [90]:
all_types = stoffgebiet_ratings['type'].sort_values().unique()
all_ratings = stoffgebiet_ratings['rating'].sort_values().unique()
In [91]:
# Nominaler Ansatz

column_names = []
for this_type in all_types:
    for this_rating in all_ratings:
        column_names.append('stoffgebiet_bewertung_' + this_type + '_' + str(this_rating))

onehot_df = pd.DataFrame(columns = column_names)

for i, element in tqdm(enumerate(meta_all.iloc), total = meta_all.shape[0]):
    this_title = element.title
    this_author = element.author
    this_ratings = stoffgebiet_ratings.query("author == @this_author and title == @this_title")
    
    for rating in this_ratings.iloc:
        this_type = rating.type
        this_rating = rating.rating
        onehot_df.at[i, 'stoffgebiet_bewertung_' + this_type + '_' + str(this_rating)] = 1
  0%|          | 0/2063 [00:00<?, ?it/s]
In [92]:
# Ordinaler Ansatz

# stoffgebiet_ratings['rating'] = stoffgebiet_ratings['rating'].replace({3 : 0, 2 : -1})
# 
# column_names = []
# for this_type in all_types:
#     column_names.append('stoffgebiet_bewertung_' + this_type)
# 
# onehot_df = pd.DataFrame(columns = column_names)
# 
# for i, element in tqdm(enumerate(meta_all.iloc), total = meta_all.shape[0]):
#     this_title = element.title
#     this_author = element.author
#     this_ratings = stoffgebiet_ratings.query("author == @this_author and title == @this_title")
#     
#     for rating in this_ratings.iloc:
#         this_type = rating.type
#         this_rating = rating.rating
#         onehot_df.at[i, 'stoffgebiet_bewertung_' + this_type] = this_rating
In [93]:
onehot_df = onehot_df.fillna(0)
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3506148593.py:1: FutureWarning:

Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`

In [94]:
if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'stoffgebiet',
    'stoffgebiet_bewertung',
    'stoffgebiet_bewertung_Militär/Krieg_1',
    'stoffgebiet_bewertung_Militär/Krieg_2',
    'stoffgebiet_bewertung_Politik_1',
]].sample(n=10)
Out[94]:
stoffgebiet stoffgebiet_bewertung stoffgebiet_bewertung_Militär/Krieg_1 stoffgebiet_bewertung_Militär/Krieg_2 stoffgebiet_bewertung_Politik_1
576 Liebe + Politik 1 + 1 0 0 1
1644 Militär/Krieg 1 1 0 0
442 Tod + Politik 1 + 2 0 0 0
1508 Militär/Krieg 3 0 0 0
153 Politik 1 0 0 1
166 Militär/Krieg + Politik 3 + 1 0 0 1
1060 Denkmal 1 0 0 0
1356 Jagd + Herrscherliches Handeln 0 + 3 0 0 0
767 Tod + Politik 1 + 0 0 0 0
821 Friede + Politik 2 + 1 0 0 1
In [95]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal_multi_dependent'] * len(column_names),
    'encoding' : ['bin'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

entity_simple [nominal_multi]¶

In [96]:
meta_all['entity_simple'] = [str(x) for x in meta_all['entity_simple']]
data = meta_all['entity_simple']
In [97]:
onehot_df = OneHotMulti(data, 'entity_simple')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning:

Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`

In [98]:
if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'entity_simple',
    'entity_simple_1',
    'entity_simple_2',
    'entity_simple_3',
]].sample(n=10)
Out[98]:
entity_simple entity_simple_1 entity_simple_2 entity_simple_3
1225 1 + 3 1 0 1
940 1 + 1 2 0 0
1894 3 + 3 0 0 2
1186 1 + 3 1 0 1
99 1 + 1 + 3 2 0 1
1189 3 + 2 0 1 1
971 1 + 1 2 0 0
344 1 + 1 + 3 + 3 2 0 2
373 1 1 0 0
1665 2 + 2 + 2 0 3 0
In [99]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal_multi'] * len(column_names),
    'encoding' : ['interval'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

entity_bewertung [nominal_multi_dependent]¶

In [100]:
entity_ratings['type'] = entity_ratings['type'].replace({'1 ' : '1'})
entity_ratings['rating'] = entity_ratings['rating'].astype(int)
In [101]:
all_types = entity_ratings['type'].unique()
all_rating_types = entity_ratings['rating'].unique()
In [102]:
# Nominaler Ansatz

column_names = []
for this_type in all_types:
    for this_rating_type in all_rating_types:
        column_names.append('entity_bewertung_' + this_type + '_' + str(this_rating_type))

onehot_df = pd.DataFrame(index = meta_all.index, columns = column_names)
onehot_df = onehot_df.fillna(0)

for i, element in tqdm(enumerate(meta_all.iloc), total = meta_all.shape[0]):
    this_title = element.title
    this_author = element.author
    this_ratings = entity_ratings.query("author == @this_author and title == @this_title")
    
    for rating in this_ratings.iloc:
        this_type = rating.type
        this_rating = rating.rating
        
        onehot_df.at[i, 'entity_bewertung_' + this_type + '_' + str(this_rating)] += 1
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3100512533.py:9: FutureWarning:

Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`

  0%|          | 0/2063 [00:00<?, ?it/s]
In [103]:
# Ordinaler Ansatz

# entity_ratings['rating'] = entity_ratings['rating'].replace({3 : 0, 2 : -1})
# 
# column_names = []
# for this_type in all_types:
#     column_names.append('entity_bewertung_' + this_type)
# 
# onehot_df = pd.DataFrame(columns = column_names)
# 
# for i, element in tqdm(enumerate(meta_all.iloc), total = meta_all.shape[0]):
#     this_title = element.title
#     this_author = element.author
#     this_ratings = entity_ratings.query("author == @this_author and title == @this_title")
#     
#     for this_type in this_ratings['type'].unique():
#         this_ratings_values = this_ratings.query("type == @this_type")['rating']
#         this_ratings_mean = this_ratings_values.mean()
#         onehot_df.at[i, 'entity_bewertung_' + this_type] = this_ratings_mean
In [104]:
onehot_df = onehot_df.fillna(0)
In [105]:
if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'entity_simple',
    'entity_bewertung',
    'entity_bewertung_1_1',
    'entity_bewertung_1_2',
    'entity_bewertung_3_1',
]].sample(n=10)
Out[105]:
entity_simple entity_bewertung entity_bewertung_1_1 entity_bewertung_1_2 entity_bewertung_3_1
1873 4 0 0 0 0
1017 1 + 1 + 1 + 3 + 1 3 + 2 + 2 + 0 + 0 0 2 0
1157 2 + 3 1 + 2 0 0 0
26 4 + 3 3 + 0 0 0 0
504 3 + 3 + 1 + 1 3 + 1 + 1 + 1 2 0 1
1688 4 1 0 0 0
1622 1 + 1 + 3 1 + 3 + 1 1 0 1
775 1 + 1 + 3 0 + 2 + 2 0 1 0
482 1 + 3 + 1 1 + 2 + 0 1 0 0
1185 1 + 2 3 + 3 0 0 0
In [106]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal_multi_dependet'] * len(column_names),
    'encoding' : ['interval'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

marker_person : Systematisieren (aufteilen in Titel/Text)¶

In [107]:
meta_all['marker_person'].value_counts().sort_index()
Out[107]:
marker_person
/               411
Text            622
Titel           118
Titel + Text    912
Name: count, dtype: int64
In [108]:
meta_all['marker_person_title'] = [1 if 'Titel' in x else 0 for x in meta_all['marker_person']]
meta_all['marker_person_text'] = [1 if 'Text' in x else 0 for x in meta_all['marker_person']]
In [109]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : ['marker_person_title', 'marker_person_text'],
    'encoding_orig' : ['bin_multi', 'bin_multi'],
    'encoding' : ['bin', 'bin'],
    'weight' : [0.5, 0.5]
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

marker_zeit : Systematisieren (aufteilen in Titel/Text)¶

In [110]:
meta_all['marker_zeit'].value_counts().sort_index()
Out[110]:
marker_zeit
/               1191
Text             773
Titel             48
Titel + Text      51
Name: count, dtype: int64
In [111]:
meta_all['marker_zeit_title'] = [1 if 'Titel' in x else 0 for x in meta_all['marker_zeit']]
meta_all['marker_zeit_text'] = [1 if 'Text' in x else 0 for x in meta_all['marker_zeit']]
In [112]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : ['marker_zeit_title', 'marker_zeit_text'],
    'encoding_orig' : ['bin_multi', 'bin_multi'],
    'encoding' : ['bin', 'bin'],
    'weight' : [0.5, 0.5]
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

marker_place : Systematisieren (aufteilen in Titel/Text)¶

In [113]:
meta_all['marker_ort'].value_counts().sort_index()
Out[113]:
marker_ort
/               1506
Text             354
Titel             66
Titel + Text     137
Name: count, dtype: int64
In [114]:
meta_all['marker_ort_title'] = [1 if 'Titel' in x else 0 for x in meta_all['marker_ort']]
meta_all['marker_ort_text'] = [1 if 'Text' in x else 0 for x in meta_all['marker_ort']]
In [115]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : ['marker_ort_title', 'marker_ort_text'],
    'encoding_orig' : ['bin_multi', 'bin_multi'],
    'encoding' : ['bin', 'bin'],
    'weight' : [0.5, 0.5]
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

marker_objekt : Systematisieren (aufteilen in Titel/Text)¶

In [116]:
meta_all['marker_objekt'].value_counts().sort_index()
Out[116]:
marker_objekt
/                848
Text            1109
Titel             22
Titel + Text      84
Name: count, dtype: int64
In [117]:
meta_all['marker_objekt_title'] = [1 if 'Titel' in x else 0 for x in meta_all['marker_objekt']]
meta_all['marker_objekt_text'] = [1 if 'Text' in x else 0 for x in meta_all['marker_objekt']]
In [118]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : ['marker_objekt_title', 'marker_objekt_text'],
    'encoding_orig' : ['bin_multi', 'bin_multi'],
    'encoding' : ['bin', 'bin'],
    'weight' : [0.5, 0.5]
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

ueberlieferung_bewertung [nominal]: One-Hot-Encoding¶

In [119]:
meta_all['ueberlieferung_bewertung'] = [str(x) for x in meta_all['ueberlieferung_bewertung']]
data = meta_all['ueberlieferung_bewertung']
In [120]:
onehot_df = OneHotMulti(data, 'ueberlieferung_bewertung')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning:

Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`

In [121]:
if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'ueberlieferung_bewertung',
    'ueberlieferung_bewertung_neutral',
    'ueberlieferung_bewertung_positiv',
    'ueberlieferung_bewertung_None',
]].sample(n=10)
Out[121]:
ueberlieferung_bewertung ueberlieferung_bewertung_neutral ueberlieferung_bewertung_positiv ueberlieferung_bewertung_None
1085 None 0 0 1
1051 positiv 0 1 0
945 None 0 0 1
1005 None 0 0 1
1477 positiv 0 1 0
1832 None 0 0 1
1624 None 0 0 1
1929 None 0 0 1
401 None 0 0 1
1064 neutral 1 0 0
In [122]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal'] * len(column_names),
    'encoding' : ['bin'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

geschichtsvorstellung_bewertung [nominal]: One-Hot-Encoding¶

In [123]:
meta_all['geschichtsauffassung_bewertung'] = [str(x) for x in meta_all['geschichtsauffassung_bewertung']]
data = meta_all['geschichtsauffassung_bewertung']
In [124]:
onehot_df = OneHotMulti(data, 'geschichtsauffassung_bewertung')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning:

Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`

In [125]:
if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'geschichtsauffassung_bewertung',
    'geschichtsauffassung_bewertung_positiv',
    'geschichtsauffassung_bewertung_negativ',
    'geschichtsauffassung_bewertung_None',
]].sample(n=10)
Out[125]:
geschichtsauffassung_bewertung geschichtsauffassung_bewertung_positiv geschichtsauffassung_bewertung_negativ geschichtsauffassung_bewertung_None
414 None 0 0 1
1310 None 0 0 1
1114 None 0 0 1
1196 None 0 0 1
752 None 0 0 1
1430 None 0 0 1
777 None 0 0 1
464 None 0 0 1
376 None 0 0 1
1812 None 0 0 1
In [126]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal'] * len(column_names),
    'encoding' : ['bin'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

verhaeltnis_wissen [nominal]: One-Hot-Encoding¶

In [127]:
meta_all['verhaeltnis_wissen'] = [str(x) for x in meta_all['verhaeltnis_wissen']]
data = meta_all['verhaeltnis_wissen']
In [128]:
onehot_df = OneHotMulti(data, 'verhaeltnis_wissen')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning:

Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`

In [129]:
if onehot_df.columns[0] not in meta_all.columns:
    meta_all = meta_all.join(onehot_df).copy()

meta_all[[
    'verhaeltnis_wissen',
    'verhaeltnis_wissen_übereinstimmend',
    'verhaeltnis_wissen_ergänzend',
    'verhaeltnis_wissen_abweichend_übernatürlich',
]].sample(n=10)
Out[129]:
verhaeltnis_wissen verhaeltnis_wissen_übereinstimmend verhaeltnis_wissen_ergänzend verhaeltnis_wissen_abweichend_übernatürlich
911 ergänzend 0 1 0
1359 abweichend_übernatürlich 0 0 1
1560 abweichend_übernatürlich 0 0 1
1774 ergänzend 0 1 0
798 abweichend_übernatürlich 0 0 1
485 ergänzend 0 1 0
1230 ergänzend 0 1 0
1551 übereinstimmend 1 0 0
1291 abweichend_übernatürlich 0 0 1
475 ergänzend 0 1 0
In [130]:
column_names = onehot_df.columns

features_used_add = pd.DataFrame({
    'feature' : column_names,
    'encoding_orig' : ['nominal_multi'] * len(column_names),
    'encoding' : ['bin'] * len(column_names),
    'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
    [features_used_df,
     features_used_add
    ]).reset_index(drop = True)

features_used_df = features_used_df.drop_duplicates(subset = 'feature')

Übersicht¶

In [131]:
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
In [132]:
features_used = features_used_df['feature'].tolist()
In [133]:
features_used_df
Out[133]:
feature encoding weight encoding_orig
0 geschichtslyrik ordinal 1.00 NaN
1 empirisch bin 1.00 NaN
2 theoretisch bin 1.00 NaN
3 sprechinstanz_markiert bin 1.00 NaN
4 konkretheit ordinal 1.00 NaN
... ... ... ... ...
1137 geschichtsauffassung_bewertung_ambivalent bin 0.20 nominal
1138 verhaeltnis_wissen_ergänzend bin 0.25 nominal_multi
1139 verhaeltnis_wissen_übereinstimmend bin 0.25 nominal_multi
1140 verhaeltnis_wissen_abweichend_übernatürlich bin 0.25 nominal_multi
1141 verhaeltnis_wissen_abweichend_natürlich bin 0.25 nominal_multi

1142 rows × 4 columns

In [134]:
meta_all.sample(n=10)[features_used]
Out[134]:
geschichtslyrik empirisch theoretisch sprechinstanz_markiert konkretheit vergangenheitsdominant zeitebenen fixierbarkeit anachronismus gegenwartsbezug ... ueberlieferung_bewertung_negativ geschichtsauffassung_bewertung_None geschichtsauffassung_bewertung_positiv geschichtsauffassung_bewertung_neutral geschichtsauffassung_bewertung_negativ geschichtsauffassung_bewertung_ambivalent verhaeltnis_wissen_ergänzend verhaeltnis_wissen_übereinstimmend verhaeltnis_wissen_abweichend_übernatürlich verhaeltnis_wissen_abweichend_natürlich
206 1.0 1.0 0.0 0.0 1.0 1.0 2.0 1.0 0.0 0.0 ... 0 1 0 0 0 0 0 0 1 0
1543 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 ... 0 1 0 0 0 0 1 0 0 0
1282 1.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 ... 0 1 0 0 0 0 1 0 0 0
1412 1.0 1.0 0.0 1.0 1.0 1.0 2.0 1.0 0.0 0.0 ... 0 1 0 0 0 0 0 0 1 0
333 1.0 1.0 0.0 0.0 1.0 1.0 2.0 1.0 0.0 0.0 ... 0 1 0 0 0 0 1 0 0 0
921 1.0 1.0 0.0 0.0 1.0 1.0 2.0 1.0 0.0 0.0 ... 0 1 0 0 0 0 1 0 0 0
1654 1.0 1.0 0.0 1.0 1.0 1.0 3.0 0.0 0.0 1.0 ... 0 1 0 0 0 0 0 0 1 0
131 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 ... 0 1 0 0 0 0 1 0 0 0
1592 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 ... 0 1 0 0 0 0 1 0 0 0
1256 1.0 1.0 0.0 1.0 1.0 1.0 2.0 0.0 0.0 0.0 ... 0 1 0 0 0 0 0 0 1 0

10 rows × 1142 columns

Umbenennen und Export¶

In [135]:
features_used_export_df = features_used_df
features_used_export_df['feature'] = ['vectortyp_' + x for x in features_used_export_df['feature']]
features_used_export_df.to_csv("../resources/more/vectors/vectordist_features.csv")
In [136]:
features_used_export_df
Out[136]:
feature encoding weight encoding_orig
0 vectortyp_geschichtslyrik ordinal 1.00 NaN
1 vectortyp_empirisch bin 1.00 NaN
2 vectortyp_theoretisch bin 1.00 NaN
3 vectortyp_sprechinstanz_markiert bin 1.00 NaN
4 vectortyp_konkretheit ordinal 1.00 NaN
... ... ... ... ...
1137 vectortyp_geschichtsauffassung_bewertung_ambivalent bin 0.20 nominal
1138 vectortyp_verhaeltnis_wissen_ergänzend bin 0.25 nominal_multi
1139 vectortyp_verhaeltnis_wissen_übereinstimmend bin 0.25 nominal_multi
1140 vectortyp_verhaeltnis_wissen_abweichend_übernatürlich bin 0.25 nominal_multi
1141 vectortyp_verhaeltnis_wissen_abweichend_natürlich bin 0.25 nominal_multi

1142 rows × 4 columns

In [137]:
export_meta = meta_all[['id'] + features_used]
export_meta.columns = ['vectortyp_' + x if x != 'id' else x for x in export_meta.columns]
export_meta.to_csv("../resources/more/vectors/vectordist.csv")
In [138]:
export_meta.head()
Out[138]:
id vectortyp_geschichtslyrik vectortyp_empirisch vectortyp_theoretisch vectortyp_sprechinstanz_markiert vectortyp_konkretheit vectortyp_vergangenheitsdominant vectortyp_zeitebenen vectortyp_fixierbarkeit vectortyp_anachronismus ... vectortyp_ueberlieferung_bewertung_negativ vectortyp_geschichtsauffassung_bewertung_None vectortyp_geschichtsauffassung_bewertung_positiv vectortyp_geschichtsauffassung_bewertung_neutral vectortyp_geschichtsauffassung_bewertung_negativ vectortyp_geschichtsauffassung_bewertung_ambivalent vectortyp_verhaeltnis_wissen_ergänzend vectortyp_verhaeltnis_wissen_übereinstimmend vectortyp_verhaeltnis_wissen_abweichend_übernatürlich vectortyp_verhaeltnis_wissen_abweichend_natürlich
0 1850.Grube.028 1.0 1.0 0.0 1.0 0.5 1.0 3.0 0.0 0.0 ... 0 1 0 0 0 0 1 0 0 0
1 1850.Kriebitzsch.001 1.0 0.0 1.0 1.0 0.0 0.5 2.0 0.0 0.0 ... 0 1 0 0 0 0 0 1 0 0
2 1850.Kriebitzsch.011 1.0 1.0 0.0 1.0 1.0 1.0 3.0 1.0 0.0 ... 0 1 0 0 0 0 1 0 0 0
3 1850.Kriebitzsch.019 1.0 1.0 0.0 1.0 1.0 1.0 3.0 1.0 0.0 ... 0 1 0 0 0 0 1 0 0 0
4 1851.Müller/Kletke.018 1.0 1.0 0.0 1.0 0.5 1.0 2.0 0.0 0.0 ... 0 1 0 0 0 0 0 1 0 0

5 rows × 1143 columns